Course: Applied Geo-data Science at the University of Bern (Institute of Geography)
Supervisor: Prof. Dr. Benjamin Stocker
Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider
Further information: https://geco-bern.github.io/agds/
Do you have questions about the workflow? Contact the Author:
Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)
Matriculation number: 20-100-178
Reference: Report Exercise 4 (Chapter 9)
This exercise is an introduction to supervised machine learning. With a realistic problem, the following goals should be trained:
Implementation of the KNN algorithm
Discuss the bias-variance trade-off
The role of k
Supervised machine learning is a type of machine learning where the model is trained using labeled data to predict new and unseen data as accurately as possible (with the lowest error in the new data). But this approach has some challenges. For example, we need to know how good the quality of the predicted data is. Therefore, we use only a part of the labeled data for training. With the other part of the labeled data, we quantify the error.
There are many ways to train the algorithm. Here we use polynomials to approximate the target. If we use a polynomial with $deg=0$, then we describe the data with an intercept, and this is probably a very poor choice. The bias will be very high since we will not hit almost any value of the target exactly. The model explains only very little of the variance of the target. The higher the degree of the polynomial, the more flexible we can be in bending, stretching, or compressing our function. In the best case, we have found a polynomial that hits every target value in the training data exactly. The bias in the training data will be zero. We tend to say that our best model would be the one that best describes the target of the training data. But this is not true.
Consider the following: If we use the model that approximates the target in the training data best, then we also have to include all errors. The target and the predictor come from measurements and have always had errors (always statistical and almost always systematic errors (measurement noise)). If we use a polynomial with $deg = 0$ we will not have a good prediction. The model describes the target in the training data so badly that it will not be able to predict it well. The RSME is high in the training data and will be high in the test data. We call that an underfitted model, and the bias-variance trade-off shifts to a very high bias. Because of the high variance in the predicted data, the RSME will also be high in the predicted data.
If we use a polynomial with \(deg = n\), then we project the error into the new data. That will cause a higher variance in the predicted data because our model is overfitted. The bias-variance trade-off in the train data shifts to the variance. The goal is to find the model with the best generalisability, which is the model with the best bias-variance trade-off. This exercise shows you how we can find the best model with a KNN algorithm.
We use the KNN (k-nearest neighbors) algorithm and compare the model to a multivariate linear regression model. For the KNN, we split our data. We use 70% for training and 30% to validate the model. Because the algorithm uses distance, we have to standardize our data. With the box-cox function, we transform our data into a normal distribution. We use set.seed for reproducibility.
The open-source program R-Studio (Version 2022.12.0+353) was used for all studies. The data handling in this program is done using the R language. We utilize the R-package “Tidyverse” and other packages to process the data effectively.
The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to choose if different functions have the same call, but do not make the same thing (a function in a certain package can have the same name as a function from another package). In this case, we will set preferences.
source("../../R/general/packages.R")
First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented an if-else statement.
name.of.file <- "../../data/re_ml_01/daily_fluxes.csv"
# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
# Access to the data
url <- "https://raw.githubusercontent.com/geco-bern/agds/main/data
/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"
# Read in the data directly from URL
daily_fluxes <- read.table(url, header = TRUE, sep = ",")
# Write a CSV file in the respiratory
write_csv(daily_fluxes, "../../data/re_ml_01/daily_fluxes.csv")
# Read the file
daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")
# If exists such a file, read it only!
}else{daily_fluxes.raw <- read_csv("../../data/re_ml_01/daily_fluxes.csv")}
If we take a closer look at the file, we see that there are 334 variables. We can not continue with our usual procedure because there are too many variables. That is why we have to clean our data first.
We select only the columns we need. Further, we can see that some columns contain -9999 as a value. Our quality function changed that to NA. Then we use ymd() from the lubridate package to rewrite the date in a proper way. Further, we want only columns that contain good quality. For that, we check selected columns with their quality control column. If the proportion of good measurement is less than 80%, then we overwrite the value with NA (and do not drop the row!). After that, we discard all quality-control columns. Now we know that our data set is of high enough quality to perform analyses with it.
# Load the function in the file
source("../../R/re_ml_01/function.use.good.quality.only.R")
# Function call to clean the data
daily_fluxes <- use.good.quality.only(daily_fluxes.raw)
# Load the function
source("../../R/general/function.visualize.na.values.R")
# Function call
visualize.na.values.without.groups(daily_fluxes)
Here we split our data into a training subset and a test subset (70% training and 30% test). After we split the data, we can calculate our model. We use these three variables as predictors: SW_IN_F, VPD_F, and TA_F . For KNN, we use k = 8 as a first try. This choice is random. The last code chunk in this workflow will show you which k is the optimal one (we do not want to spoil anything here). That is why you should use k = 8 first. Now we calculate a KNN model and an LM model.
# For reproducibility (pseudo-random)
set.seed(123)
# Split 70 % to 30 %
split <- rsample::initial_split(daily_fluxes, prop = 0.7, strata = "VPD_F")
# Split the data in a training set
daily_fluxes_train <- rsample::training(split)
# And a test set
daily_fluxes_test <- rsample::testing(split)
# Load the function in the file.
source("../../R/re_ml_01/function.train.and.test.R")
# Function call for KNN with k = 8
mod.knn <- knn.model(daily_fluxes_train, 8)
# Function call for lm
mod.lm <- lm.model(daily_fluxes_train)
After we split our data, we have to make a model evaluation. The first function call is for the LM model. The function needs the model, which we have calculated in the code chunk above. This is a multivariate regression model, and therefore, we cannot improve the formula.
The second function call is for the KNN model. The function needs the knn-model as an input. The knn-model has been calculated in the code chunk above. The number for k was 8. If we want a model with a different k, we have to recalculate the knn-model with the function in the code chunk above.
# Load the function
source("../../R/re_ml_01/function.evaluation.model.R")
# Function call for linear regression model
eval_model(mod = mod.lm, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (lm)"), c("Davos (lm)"))
# Function call for KNN
eval_model(mod = mod.knn, df_train = daily_fluxes_train, df_test = daily_fluxes_test, c("Davos (knn: k=8))"), c("Davos (knn: k=8)"))
Here you can see how the knn-model depends on k. If we use k=0, then we have perfect training data. But the RMSE in the test data is high (the model is overfitted). If we use k = n our model would be underfitted, and therefore the RMSE in the test data would also be high. For a visualization of the development of RSQ and RSME as a function of k, please take a look at the last visualization.
# Load the function into the file
source("../../R/re_ml_01/function.different.k.R")
# Define a sequence for k
my.sequence <- c(1, 8, 25, 50, 100)
# Function call
different.k(my.sequence,daily_fluxes_train,daily_fluxes_test,c("Davos"),c("Davos"))
## k-Nearest Neighbors
##
## 1910 samples
## 8 predictor
##
## Recipe steps: BoxCox, center, scale
## Resampling: None
In this short discussion part, we want to show you two main aspects of this exercise and discuss each aspect shortly. First, we need to know how to judge the models. For this, we need to know why one model can be considered better than another. How can we estimate the bias-variance trade-off? Second, we need to know how the KNN algorithm works. For this, we need knowledge about the role of k-neighbors.
In this part, we discuss the differences between a LM model and a KNN model.
To answer this question, we have to take a closer look at the formulas:
An LM model uses all predictors to calculate a multivariate regression model. However, it is a linear regression. It means that the polynomial has a deg(polynomial) = 1. In the training set, it creates a straight line in a multidimensional space with dim = N.
\[ \hat{y}=a_0 +\sum_{n=1}^{N}\hat{a}_n*x_n \]
Obviously, a lm model does not look for any underlying pattern. It creates a linear combination, and all predictors should be linearly independent. If we want to apply the model then we must consider this fact. To create a lm model, we use the training data. But we calculate only regression, and therefore it is not possible that RSQ = 1 (only if the predictor is also the target and the predictors are linearly dependent). Therefore, the lm model will always have a bias. Most likely, the RMSE and MAE will increase if we make predictions because the lm model does not know anything about the underlying pattern.
The situation for KNN is different. KNN calculates a polynomial regression with the following formula:
\[ \hat{y}=\sum_{n=0}^{N}\hat{a}_n*x^n \]
It uses not only all predictors. It uses k neighbors to calculate a local regression. The KNN model looks for an underlying pattern. That is why the RSQ in the training data is lower than for the lm model. If we want to make a prediction, the KNN has an advantage over the lm. KNN “knows” the underlying pattern, and the RMSE and MAE will not increase fast.
Because the RMSE and MAE in the KNN are lower than in the lm model. Further, RSQ is higher in the KNN model than in the lm model. The KNN model performs predictions with a low bias (MAE) and low variance (high RSQ) and is better than the lm model with a higher bias (MAE is higher) and more variance (lower RSQ). It follows that the KNN model has a better bias-variance trade-off than lm model But be careful: RMSE and RSQ in the KNN model are dependence directly on k.
For the lm model the variance increased.
The KNN model seems overfitted. The RSQ in the training data is high and jumps down for the test data. The RMSE increased from the training data to the test data. This concludes that the KNN model is more on the biased side in the bias-variance trade-off.
The lm model is more or less the same.
It seems that our model predicts better if the target is between 2005 and 2010. We can also see that the difference between the models is not very big. But the residue is very high because the target has a quantity of about $10^{-1}$.
# Load the function into the file
source("../../R/re_ml_01/function.time.variation.R")
source("../../R/re_ml_02/function.vis.residue.R")
# Function call
time.variation(mod.knn, mod.lm, daily_fluxes_test)
# lm, monthly resolution
time.variation.ml.02(mod.lm,daily_fluxes_test,c("lm-Model"), c("Davos"),
c("re_ml_01 (Chapter 9)"))
# KNN, monthly resoultion
time.variation.ml.02(mod.knn,daily_fluxes_test, c("KNN-Model"),
c("Train: Davos, Test: Davos, k = 25"),
c("re_ml_01 (Chapter 9)"))
The “k” in KNN is a parameter that refers to the number of nearest neighbors. Here, we use MAE instead of RMSE. Although the values are different, the flows are synchronous. That means that both MAE and RMSE are hyperbolas and have their global minima at the same k.
Hypothesis:
The lower we chose k, the higher the RSQ in the training data, but the higher the variance in the test data, therefore the higher the MAE.
Explanation: Let k = 1. Then the training data would be perfectly
predicted because only one
The nearest) neighbor would be taken into account. The bias would be
zero, and therefore RSQ = 1, and the MAE is 0 as well. But the model
would be totally overfitted. That means there is a bias-variance
trade-off. If the bias in the training data is low, then the variance in
the test data is, in general, high.
Let k = N. Then it would follow: RSQ is an element of (0, 1), and the MAE is an element of [0, inf). But we do not know where it is. However, the model tends to be underfitted because it takes all neighbors into account. Therefore, the prediction would be the mean value of the observations.
Conclusion: We want a model with zero bias and zero variance. But this seems impossible, so we have to find the model with the best bias-variance-trade-off.
The hypothesis is not fully true. RSQ is a quantity to describe the variance. Or better, it describes how much of the variance explains the model. For the best bias-variance trade-off, we want the k where the bias in the test data is lowest (low MAE), and the variance should be. The model for k = 1 is totally overfitted. In the training data, the RSQ is 1 and it follows that the bias must be 0. But in the test data, the MAE is at a global maximum. The model is useless for any predictions, and the trade-off is on the side of too much variance.
For k = 200 the RSQ is lowest in the training data. The MAE increased for both data sets but is not at a maximum. The model is not suitable for predictions because the trade-off is on the side of too much bias.
The optimal k is there, where the trade-off is best. Generalisability means that we want the model with the lowest MAE in the test data (low bias) and the highest RSQ (low variance). For that, we mark the minimum of the RMSE test data hyperbola (green point = 25). Unfortunately, k = 25 is not necessarily the best model. We have only seen that it is between 20 and 30. We make another function call and use all k between 20 and 30. Now we see that optimal k is 25, and we can choose this model.
Important: Here we set a seed and split our data only pseudo-randomly. If we do not use set.seed(123), then it would be another optimal k.
# Load the function into the file
source("../../R/re_ml_01/function.parameter.extracter.R")
# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5))
# Visualize the MAE and RSQ --> optimal k is 25 (between 20:30)
parameter.extracter(my.sequence, daily_fluxes_train, daily_fluxes_test)
# Visualize MAE and RSQ for k = 20:30 --> it is 25
parameter.extracter(c(20:30), daily_fluxes_train, daily_fluxes_test)